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Abstract 

. We consider infinite-horizon 7-discounted Markov Decision Processes, for which it is known that 

j^ ' there exists a stationary optimal policy. We consider the algorithm Value Iteration and the sequence of 

policies TTi,. . . ,nk it implicitely generates until some iteration k. We provide performance bounds for 

non-stationary policies involving the last m generated policies that reduce the state-of-the-art bound for 

^^ ' the last stationary policy iTk by a factor -^]^Zi ■ In particular, the use of non-stationary policies allows 

fT^ , to reduce the usual asymptotic performance bounds of Value Iteration with errors bounded by e at each 
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iteration from ,-^J .^ e to yz— e, which is significant in the usual situation when 7 is close to 1. Given 
Bellman operators that can only be computed with some error e, a surprising consequence of this result is 
that the problem of "computing an approximately optimal non-stationary policy" is much simpler than 
that of "computing an approximately optimal stationary policy" , and even slightly simpler than that of 
"approximately computing the value of some fixed policy" , since this last problem only has a guarantee 

of -r^e. 
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_^ , Given a Markov Decision Process, suppose on runs an approximate version of Value Iteration, that is one 

ly-s ' builds a sequence of value-policy pairs as follows: 

• , Pick any vr^+i in Qvk 

CN ' n • 

where vq is arbitrary, Qv^ is the set of policies that are greedjo with respect to Wfc, and T^^^ is the linear 
Bellman operator associated to policy TTfe. Though it does not appear exactly in this form in the literature, 
- , , the following performance bound is somewhat standard. 

'V^ \ Theorem 1. Let e — maxi<j<fc ||ej||^ he a uniform upper hound on the span seminorrr^ of the errors hefore 

C^ ■ iteration k. The loss of policy TTfc is hounded as follows: 



In Theorem [21 we will prove a generalization of this result, so we do not provide a proof here. Since for 
any /, ||/||g„ < 2||/||oo, Theorem [T] constitutes a slight improvement and a (finite-iteration) generalization 
of the following well-known performance bound (see [1]): 

27 
lim sup ||w* - WttJIoo < -Tj -^max||efc||oo- 



'^ There may be several greedy policies with respect to some value v, and what we write here holds whichever one is picked. 

^For any function / defined on the state space, the span seminorm of / is H/H^p = maxs /(s) — mins /(s). The motivation 
for using the span seminorm instead of a more usual Loo-norm is twofold: 1) it slightly improves on the state-of-the-art bounds 
and 2) it simplifies the construction of an example in the proof of the forthcoming Proposition [T] 




Figure 1: The deterministic MDP used in the proof of Proposition [T] 

Asymptotically, the above bounds involve a /^2 \-i constant that may be really big when 7 is close to 1. 
Compared to a value-iteration algorithm for approximately computing the value of some fixed policy, and 
for which one can prove a dependency of the form jrTje, there is an extra term y^ that suggests that the 
problem of "computing an approximately optimal policy" is significantly harder than that of "approximately 
computing the value of some fixed policy" . To our knowledge, there does not exist any example in the 
literature that supports the tightness of the above mentionned bounds. The following proposition shows 
that the bound of Theorem [1] is in fact tight. 

Proposition 1. For aH e > 0, A > 0, and k > 0, there exists a k + 1-state MDP, an initial value vq such that 
\\v* ^^'ollsp = A, a sequence of noise terms (ej) with ||ej||^,„ < e, such that running Value Iteration during 
iterations with errors (ej) outputs a value function v^-i of which a greedy policy TTk satisfies Equation ([T]) 
with equality. 

Proof. Consider the deterministic MDP of Figure [TJ The only decision is in state Sk, where one can stay 

with reward r = — ''~_J e — 7'^A or move to Sk-i with reward. All other transitions give reward. Thus, 
there are only two policies, the optimal policy tt* with value equal to 0, and a policy n for which the value 
in Sk is Y^— . Take 

voisi) = I -^ '^^l' ° and for all j < fc, 6,(^0 ^ | ^' ^e." ^ 
By induction, it can be seen that for all j G {1, 2, . . . , fc — 1}, 

, , f -6-76 7J-ie-7^A = -i^^e-7^A ii j ^ I 

^ ' \ iij <l<k 

Since jvk-i{sk-i) — r and Vk-\{sk) — 0, both policies are greedy with respect to ffc_i, and the bound of 
Equation ([T]) holds with equality for 7f. D 

Instead of running the last stationary policy tt^-, one may consider running a periodic non-stationary 
policy, which is made of the last m policies. The following theorem shows that it is indeed a good idea. 

Theorem 2. Let i^k,ra be the following policy 

TTfc,™ = TTfe TTfe-l • • • TTk-m+1 TTfc TTk-1 ■ ' ■ ■ 

Then its performance loss is bounded as follows: 
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When m — 1, one exactly recovers the result of Theorem [TJ For general m, this new bound is a factor 
jE~h: better than the usual bound. Taking m = k, that is considering all the policies generated from the 
very start, one obtains the following bound: 
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that tends to tz— e when k tends to oo. In other words, we can see here that the problem of "computing 
a (non stationary) approximately-optimal policy" is not harder than that of "computing approximately the 
value of some fixed policy" . Since the respective asymptotic errors are t^— e and j^— e, it seems even simpler ! 

Proof of Theorem\^ The value of T:k,m satisfies: 

By induction, it can be shown that the sequence of values generated by the algorithm satisfies: 

m — 1 
^TTfeWfe-l = T^kT-jTk-i ■ ■ -T-^k-^+lVk-m + 2^ ^k,i<^k-i (3) 

2=1 

where 

-L k,i — t^TTk^T'k-l ' ■ ' ^TTk-i + l 

in which, for all tt, P^ denotes the stochastic matrix associated to policy tt. By substracting Equations ^ 
and ©, one obtains: 

m— 1 
i=\ 



and by taking the norm 



\TT,^Vk-\ -■yTTfc^lloo = T^lkfc-m - W7rfc,„||oo + -j Eoo (4) 
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where too — maxi<j<fc ||ej||oo. Essentially, Equation Q shows that for sufficiently big ?n, T^^^v-^-x is an jz— e 
approximation of the value of the non-stationary policy Tr^.m (whereas in general, it may be a much poorer 
approximation of the value of the stationary policy tt^ . 
By induction, it can also be proved that 

||U* -Wfclloo < 7 ll'"* - Volloo + Coo- (5) 

1 -7 

Using the fact that ||r^,f* — Tirfcffc-iHoo < 7lk* ^ Vk-\\oQ since tt* (resp. tt/j) is greedy with respect to i;* 
(resp. Wfe-i), as well as Equations ^ and ([S]), we can conclude by observing that 

,1 , 7-7™ 



< 7111;* - Wfc-l||oo +7™lkfc-m -I'TTfc^lloo 1 — 
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< -y ( 7*-^"l||i;^ -Uolloo H — Eoo j +7™ (ll"fe-m -l'*|loo + ||w* -^77^,^ 
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<7''lk*-t'o||oo + '^~^ ^ 
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+ 7 7 ||v*-«o||oo + — eoo + ||w*-'y^.,J|oo +^:^ —t^ 

V 1-7 ■/1-7 

= 7"'lk* - v^k ™ lloo + 27*^111;, - wolloo + ^7 —^oo- 
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Adding a constant to the value Vj at any step j of the algorithm does not affect the greedy policy set Qvj 
and only adds a constant to the next value Wj+i- As a consequence, we can assume witout loss of generality 
that ||u* - uoll^p = 2||u* - wolloo, llejILp = 2||ej||oo and the result follows. D 



From a bibliographical point of view, the idea of using non-stationary policies to improve error bounds 
already appears in [2]. However, in these works, the author considers finite-horizon problems where the 
policy to be computed is naturally non-stationary. The fact that non-stationary policies (that loop over the 
last m computed policies) can also be useful in an infinite horizon context is to our knowledge new. 
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